19 January 2017
posts
#python, #facebook, #analysis, #pandas, #NLTK, #Bokeh

In November 2016 I worked for Parkinson's UK to help them finding out more about people who used their services (members, shops, library, support group, forum, and donations). During the 5 weeks project, I downloaded their public Facebook's Page and decided to analyse it after our project had ended (5 weeks was clearly not enough to do everything they had in mind).

This post is the first of three posts analysing that dataset.

At the start of our project, we discussed the possibility to use some Machine Learning techniques to find who of the users had the Parkinson's condition. This would have allowed us to find different behavioral patterns between those with the Parkinson's condition, their carers and families, health care practitioners, or researchers. One of the main issue in using Machine Learning in this context, is the requirement to have a pre classified set. I decided to manually classify it, and then to see if there were different speech patterns between people with Parkinson's (pwp) and others (carers, health care practitioners, or researchers).

PART 1 . In a first part, I started by analysing the difference of speech between Parkinson's UK (the owner of the page - here I call them PUK) and their readers (PUKreaders). I found that they had similar center of interests (diagnosis, medication, research, raising money and awareness), with different priorities (PUK focused on research and PUKreaders on medication). They also used different vocabulary: PUK used condition, while PUKreaders used disease.

PART 2 . Then, using a similar approach, I will be analysing the difference of speech between people with Parkinson's and the others.

PART 3 . Finally, I will look at the successful posts - the posts that attracted more comments, likes, and shares; to find the patterns of success (posting a video, talking about a specific story, or a treatment).

In [1]:

from __future__ import division
import json
import nltk
from nltk import bigrams
from collections import Counter
import re
from nltk.corpus import stopwords
from nltk.text import TokenSearcher
import nltk.collocations
import nltk.corpus
import collections
from nltk import word_tokenize, FreqDist
import string
import io
import sqlite3
import numpy as np
import pandas as pd
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))
### BOKEH
from bokeh.charts import Bar, Scatter, output_file, show
from bokeh.io import output_notebook, push_notebook
from bokeh.charts.attributes import CatAttr
from bokeh.plotting import figure, output_file, show, ColumnDataSource
output_notebook()
from bokeh.models import HoverTool
from bokeh.layouts import gridplot
from bokeh.models.ranges import Range1d

Loading BokehJS ...

0 - Clean the data¶

I started by removing the stopwords (common English words) using NLTK Natural Language Toolkit, and the punctuation using the string python library. I also made parkinson's uk as one word to be able to separate it from parkinson's, and corrected some obvious mispelling that appeared in very common words.

In [2]:

punctuation = list(string.punctuation)
stop = stopwords.words('english') + punctuation + ['https','http','org',u'“',u'’',u'–','www']
dic_replace = {"parkinson u'\u2019' s uk ":'parkinsonsuk ',"parkinson u'\u2019' s":'parkinsons',
               "parkinson's uk ":'parkinsonsuk ',"parkinson's":'parkinsons',
               'carers ':'carer ','thank you ':'thanks','weeks ':'week',
               'treatmen ':'treatment','sympto ':'symptoms','symptom ':'symptoms',
               'dads ':'dad', 'mums':'mum', 'years':'year'}
 
def tokenize(s):
    return nltk.word_tokenize(s)
 
def preprocess(s):
    s = s.lower()
    for w in dic_replace:
        s = s.replace(w,dic_replace[w])
    tokens = nltk.word_tokenize(s)
    tokens = [token.lower() for token in tokens if token.isalpha()]
    return tokens

def lightclean(s):
    s.lower()
    for w in dic_replace:
        s = s.replace(w,dic_replace[w])
    for p in punctuation:
        s = s.replace(p,'')
    return s

def cleantext(fname,analysis_name):
    error = 0
    with open(fname, 'r') as f:
        count_stop = Counter()
        count_bigram = Counter()
        for line in f:
            posts = json.loads('{}'.format(line))
            for post in posts:
                try:
                    terms_stop = [term for term in preprocess(post['content']) 
                                  if term not in stop]
                    terms_bigram = bigrams(terms_stop)
                    terms = [term for term in preprocess(post['content'])
                             if term not in stop and len(term) != 1]
                except:
                    error += 1
                count_stop.update(terms_stop)
                count_bigram.update(terms_bigram)

    nElements = 50
    with open('bigrams_'+analysis_name+'.txt', 'w') as f:
        f.write(str(count_bigram.most_common(nElements)))
    word_freq = count_stop.most_common(nElements)
    # Export the word frequency to json
    with io.open('wordfreq_'+analysis_name+'.json', 'w', encoding='utf-8') as f:
        f.write(unicode(json.dumps(word_freq, ensure_ascii=False, encoding='utf8')))
cleantext('posts.json','all')
cleantext('posts_puk.json','puk')
cleantext('posts_pukreaders.json','pukreaders')

1 - PUK vs PUKreaders¶

I divided my dataset in two groups: Parkinson's UK (PUK) and their readers (PUKreaders). I expected that PUK andPUKreaders would use different terminologies but would also have different centers of interest regarding the condition.

Let's first look at the number of posts written:

In [3]:

with io.open('posts_puk.json',encoding='utf-8') as f_puk, io.open('posts_pukreaders.json',encoding='utf-8') as f_pukreaders:
    posts_puk = json.loads(f_puk.read(), encoding='utf8')
    posts_pukreaders = json.loads(f_pukreaders.read(), encoding='utf8')
print "Parkinson's UK wrote ", len(posts_puk), 'posts.'
authors = []
content_puk = []
content_pukreaders = []
for i in range(len(posts_pukreaders)):
    authors.append(posts_pukreaders[i]['person_hash_id'])
    content_pukreaders.append(lightclean(posts_pukreaders[i]['content']))
for i in range(len(posts_puk)):
    content_puk.append(lightclean(posts_puk[i]['content']))
authors = set(authors)
print 'Their readers wrote', len(posts_pukreaders), 'posts,','written by', len(authors), 'authors; '\
'which is', round(len(posts_pukreaders)/len(authors),1), 'posts per authors.'

Parkinson's UK wrote  390 posts.
Their readers wrote 839 posts, written by 515 authors; which is 1.6 posts per authors.

1 - a. Word frequency¶

I prepared the data to make a bar chart of the 50 most common words in both PUK and PUKreaders posts. To do this, I used plain python with a Counter() to count to number of times words appeared in the posts, ordered the list, and then took the 50 first words for each group.

In [4]:

with open('wordfreq_puk.json', 'r') as fpuk, open('wordfreq_pukreaders.json', 'r') as fpukreaders:
    words_puk = json.load(fpuk)
    words_pukreaders = json.load(fpukreaders)
freqwords_all = []
freqwords_puk = []
freqwords_pukreaders = []
for word in words_puk:
    freqwords_puk.append(word[0])
    freqwords_all.append(word[0])
for word in words_pukreaders:
    freqwords_pukreaders.append(word[0])
    freqwords_all.append(word[0])
# contains the list of 50 freq words between PUK and PUK readers
freqwords_all = list(set(freqwords_all)) 
freqwords_puk = list(set(freqwords_puk)) 
freqwords_pukreaders = list(set(freqwords_pukreaders)) 
# prepare data for the bar chart
df_puk = pd.DataFrame(words_puk,columns=['word','PUK'])
df_pukreaders = pd.DataFrame(words_pukreaders,columns=['word','PUKreaders'])
df = pd.merge(df_puk,df_pukreaders,on='word',how='outer')
df['diff'] = (df['PUK'] - df['PUKreaders']).fillna(0)
df = df.sort_values(['diff'])
df_plot = pd.melt(df,id_vars=['word'],value_vars=['PUK','PUKreaders'])

The following bar chart shows the 50 most used words by both groups. It's a bokeh graph, so you can interact with it: hover to read the underlying data, zoom and pan. The javascript library is still in development, so it's a big buggy in this Notebook (if you get lost on the zoom, just reload this page). Hopefully this will improve soon.

In [5]:

hover = HoverTool(
        tooltips=[
            ("value", "@y"),
        ]
    )
bar = Bar(df_plot, label=CatAttr(columns=['word'], sort=False), values='value', 
          tools=[hover,'pan', 'wheel_zoom'], 
                 toolbar_location="above", 
          stack='variable', title="Frequency of 50 most frequent words", 
          width=600, height=300,legend='top_right',bar_width=0.7)
bar.xaxis.major_label_orientation = 20
bar.xaxis.major_label_text_font_size = '8pt'
bar.xaxis.axis_label = None
bar.yaxis.axis_label = None
show(bar,notebook_handle=True);

This bar chart is ordered by difference between the frequency of each word in both group's text: from left with words more present in Parkinson's UK's readers' posts, to the right with words more present in Parkinson's UK posts. In the center, the words with only a green bar in are only frequent in Parkinson's UK posts, while the pinkish bar only are only frequent in readers of Parkinson's UK.

This shows four clusters:

The words frequent only in PUK's text
The words frequent only in not PUK's text
The words frequent for both, more often in PUK
The words frequent for both, more often in PUKreaders

We first look at all the text from either author type.

In [6]:

df_pukonly = df[df['PUKreaders'].isnull()]
df_pukreadersonly = df[df['PUK'].isnull()]
df_both =  df.ix[(~df['PUKreaders'].isnull() | ~df['PUK'].isnull())]
df_morepuk = df.ix[(~df['PUKreaders'].isnull() | ~df['PUK'].isnull()) & (df['PUK']>df['PUKreaders'])]
df_morepukreaders = df.ix[(~df['PUKreaders'].isnull() | ~df['PUK'].isnull())& (df['PUK']<df['PUKreaders'])]

printmd('**Only frequent in PUKs posts**:')
print ', '.join(str(x) for x in df_pukonly['word'].values)

printmd('**Only frequent in PUKreaders posts**:')
print ', '.join(str(x) for x in df_pukreadersonly['word'].values)

printmd('**Frequent words in both**:')
print ', '.join(str(x) for x in df_both['word'].values)

printmd('**Frequent words in both, more frequent for PUK**:')
print ', '.join(str(x) for x in df_morepuk['word'].values)

printmd('**Frequent words in both, more frequent for not PUK**:')
print ', '.join(str(x) for x in df_morepukreaders['word'].values)

Only frequent in PUKs posts:

affected, story, something, need, living, first, challenge, find, research, make, new, diagnosis, condition, take, part, way, join, things, difference, read

Only frequent in PUKreaders posts:

thanks, walk, charity, much, parkinsonsuk, pd, got, disease, hi, money, also, mum, see, go, anyone, even, great, please, may, going

Frequent words in both:

would, dad, like, uk, year, today, know, last, parkinsons, many, one, get, want, week, awareness, support, raise, work, friends, thanks, affected, walk, charity, much, story, parkinsonsuk, something, need, living, pd, got, disease, hi, money, also, mum, see, go, anyone, even, great, please, first, challenge, find, research, make, new, diagnosis, condition, take, part, may, share, way, join, going, things, difference, read, still, day, time, family, diagnosed, could, help, life, people, us

Frequent words in both, more frequent for PUK:

still, day, time, family, diagnosed, could, help, life, people, us

Frequent words in both, more frequent for not PUK:

would, dad, like, uk, year, today, know, last, parkinsons, many, one, get, want, week, awareness, support, raise, work, friends

From this list, we can see that:

Readers posting on Parkinson's UK page express themselves differently: they use polite words such as 'Hi', 'thanks', or 'please'.
They refer to the Parkinson's condition as a 'disease' or 'pd' (for Parkinson's disease), while Parkinson's UK use the word 'condition'.
Parkinson's UK only speaks frequently about 'diagnosis' (noun, factual and general terminology) or diagnosed (verb, emotional and passive), while readers only speak of 'diagnosed'; although it's not one of the most frequent
Regarding their use of indefinite pronouns, Parkinson's UK uses most frequently 'something' (along with the word 'things') while readers of Parkinson's UK uses 'anyone'. This suggest that Parkinson's UK talk more about objects than people, while their readers talk more often about people.
Both use similarly 'raise', 'support', 'awareness', making raising awareness it is their common interest

Frequent words per author, in both groups¶

This first analysis considered the entire text for each group, without distinguishing between authors (one person could have repeated the same word multiple times, and this one would artificially become frequent). Therefore, instead of considering the entire text, I looked at each new author's post in PUK's readers. I then counted the number of times it is mentioned by a new person.

I made a scatterplot to show the frequency of words with unique authors with Parkinson's UK frequent words in the x axis, and their readers in the y axis. All the words aligned on x=0 or y=0 are words that are not frequent in one of the groups; although it does not mean they did not use it.

You can hover on the scatter plot to see then word each circles refers to as well as zoom in.

In [7]:

with io.open('posts_puk.json',encoding='utf-8') as f:
    puk = json.load(f)
with io.open('posts_pukreaders.json') as f:
    pukreaders = json.load(f)

comp = []
def count_s(s,dataset):
    count = 0
    if dataset == puk:
        for i in range(len(dataset)):
            if s in dataset[i]['content']:
                count += 1
        comp.append(["Parkinson's UK",s,count,round(100*count/len(dataset),1)])
    else:
        authorsaidit = []
        for i in range(len(dataset)):
            if s in dataset[i]['content'] :
                if dataset[i]['person_hash_id'] not in authorsaidit:
                    authorsaidit.append(dataset[i]['person_hash_id'])
                    count += 1
        comp.append(["Parkinson's UK readers",s,count,round(100*count/len(dataset),1)])
    return comp

for w in freqwords_puk:
    count_s(w,puk)
for w in freqwords_pukreaders:
    count_s(w,pukreaders)
df_comp_all = pd.DataFrame(comp,columns=['AuthorType','Word','NbpostsAuthors','Percentage'])
df_scatter = df_comp_all.copy()
df_scatter = df_scatter[['AuthorType','Word','Percentage']].set_index(['AuthorType','Word'],
                                                                      append=True)
df_scatter = df_scatter.unstack('AuthorType')
df_scatter = df_scatter.stack(0)
df_scatter = df_scatter.reset_index().drop(['level_0','level_2'],axis=1)

df_scatter['PUK'] = df_scatter.groupby(['Word'])["Parkinson's UK"].transform('sum')
df_scatter['PUKreaders'] = df_scatter.groupby(['Word'])["Parkinson's UK readers"].transform('sum')
df_scatter = df_scatter.drop_duplicates('Word')
df_scatter = df_scatter.fillna(0)

source = ColumnDataSource(
        data=dict(
            x=df_scatter["PUK"],
            y=df_scatter["PUKreaders"],
            desc=df_scatter["Word"],
        )
    )

hover = HoverTool(
        tooltips=[
            ("word", "@desc"),
        ]
    )

scatter = figure(plot_width=550, plot_height=400, tools=[hover,'pan', 'wheel_zoom'], 
                 toolbar_location="right")
scatter.circle('x', 'y', size=5, source=source)
scatter.xaxis.axis_label = "Parkinson's UK"
scatter.yaxis.axis_label = "Parkinson's UK readers"
scatter.title.text = "Most frequent words - separating authors"
show(scatter,notebook_handle=True);

I was struck by some words that seem to come in pairs because they expressed a common concept. I selected 6 and looked at them more closely.

'dad' vs 'mum'
'disease' vs 'condition'
'help' vs 'support'
'diagnosis' vs 'diagnosed'
'research' vs 'money'

In [8]:

comp = []
listoflistwords = [['dad','mum'],['disease','condition'],['help','support'],['diagnosis', 
             'diagnosed'],['research','money'],['family','friends']]
listwords = ['dad','mum','disease','condition','help','support','diagnosis', 
             'diagnosed','research','money','family','friends']
def barplotdata(listoflistwords,plotnb):
    w2 = listoflistwords[plotnb-1]
    for w in w2:
        count_s(w,puk)
        count_s(w,pukreaders)
    return comp

df_comp1 = pd.DataFrame(barplotdata(listoflistwords,1),columns=['AuthorType','Word','NbpostsAuthors','Percentage'])
comp = []
df_comp2 = pd.DataFrame(barplotdata(listoflistwords,2),columns=['AuthorType','Word','NbpostsAuthors','Percentage'])
comp = []
df_comp3 = pd.DataFrame(barplotdata(listoflistwords,3),columns=['AuthorType','Word','NbpostsAuthors','Percentage'])
comp = []
df_comp4 = pd.DataFrame(barplotdata(listoflistwords,4),columns=['AuthorType','Word','NbpostsAuthors','Percentage'])
comp = []
df_comp5 = pd.DataFrame(barplotdata(listoflistwords,5),columns=['AuthorType','Word','NbpostsAuthors','Percentage'])
comp = []
df_comp6 = pd.DataFrame([["PUK",' ',0,0.01],["PUK readers",' ',0,0.01],
                         ["PUK",' ',0,0.01],["PUK readers",' ',0,0.01]],
                          columns=['AuthorType','Word','NbpostsAuthors','Percentage'])
        
bar1 = Bar(df_comp1, label='Word', values='Percentage',legend=None,
            width=150,height=200,group='AuthorType')
bar2 = Bar(df_comp2, label='Word', values='Percentage',legend=None,
          width=150,height=200,group='AuthorType')
bar3 = Bar(df_comp3, label='Word', values='Percentage',legend=None,
          width=150,height=200,group='AuthorType')
bar4 = Bar(df_comp4, values='Percentage',legend=None,
          width=150, label='Word',height=200,group='AuthorType')
bar5 = Bar(df_comp5, label='Word', values='Percentage',legend=None,
          width=150,height=200,group='AuthorType')
bar6 = Bar(df_comp6,  values='Percentage',
          width=150,height=200,group='AuthorType')

barlist = [bar1,bar2,bar3,bar4,bar5,bar6]
df_complist = [df_comp1,df_comp2,df_comp3,df_comp4,df_comp5,df_comp6]

countbar = 0
for b in barlist:
    b.y_range = Range1d(0,30)
    countbar += 1
bar1.yaxis.axis_label = "Percentage of posts"
for b in barlist[1:5]:
    b.yaxis.axis_label = None
    b.xaxis.axis_label = None
    
bar1.y_range = Range1d(0,30)
bar2.y_range = Range1d(0,30)
bar3.y_range = Range1d(0,30)
bar4.y_range = Range1d(0,30)
bar5.y_range = Range1d(0,30)
bar6.y_range = Range1d(0,30)

bar6.axis.visible = False
bar6.ygrid.grid_line_color = None
bar6.outline_line_color = None
bar6.legend.spacing = 10
bar6.legend.padding = 0
bar6.legend.margin = 0
bar6.legend.border_line_color = 'white'
# make a grid
grid = gridplot([[bar1,bar2,bar3], [bar4,bar5,bar6]])

show(grid,notebook_handle=True);

Dad vs Mum

Both PUK and their readers speak more about dads than mums. Parkinson's condition is often diagnosed as people get older, and although there are more older women than older men, research shows that women are less at risk of having the Parkinson's condition (in the Western world, this is not true for Asian countries apparently). The prevalence of Parkinson's in the male population could therefore explain why male parents are more often mentioned than female parents.

On the other hand, Parkinson's UK does not speak proportionally as much of 'mum' (2.6 times less), in comparison to their readers (1.46).

Condition vs Disease

Although the Parkinson's condition is called Parkinson's disease in the common language, it is not a disease, as you can't be cured from it, but a condition. Therefore Parkinson's UK is careful to use condition rather than disease, but the distinction is not passed yet to their readers who use the term very unfrequently.

Help vs Support

Parkinson's UK speaks more about help and support than their readers, but the difference between both is relatively similar for Parkinson's UK (help/support = 1.44) than Parkinson's UK readers (help/support = 1.57).

Diagnosed vs Diagnosis

Here the difference between both is striking: Parkinson's UK speaks 1.8 times less about diagnosis than being diagnosed, while their readers speak 5.4 times less about diagnosis than being diagnosed. Therefore the state of diagnosed is much more important than the diagnosis in itself for Parkinson's UK's readers.

Money vs Research

Parkinson's UK talks much more frequently about research than money, while their readers talk more about money and very little about research (although it is definitely a frequent word, and therefore an interest to them).

1 - b. Context¶

With a Natural Language Processing package such as NLTK, it is possible to look now into the context in which these words are used. I first used a function which finds unique expressions, by cutting the text into 'tokens'. It creats a corpus used by NLTK to perform some common analyses.

In [9]:

text_puk = ' | '.join(x.lower() for x in content_puk)
text_pukreaders = ' | '.join(x.lower() for x in content_pukreaders)
textnltk_puk = nltk.Text(word_tokenize(text_puk))
textnltk_pukreaders = nltk.Text(word_tokenize(text_pukreaders))
def find_unique_exp(text,exp):
    uniqu = []
    match_tokens = TokenSearcher(text).findall(exp)
    for x in match_tokens:
        uniqu.append(' '.join(x))
    #return list(set(uniqu))
    return ', '.join(str(x.encode('utf-8')) for x in list(set(uniqu)))

Collocations¶

I first looked at the collocations; that is, the words that often appear together. We take the entire corpus and find all the pair of words that appear together.

In [10]:

printmd("**Parkinson's UK**")
print textnltk_puk.collocations()
printmd("**Parkinson's UK readers**")
print textnltk_pukreaders.collocations()

Parkinson's UK

mervyn peake; peake awards; year ago; raise awareness; attendance
allowance; big difference; bring forward; awareness week; make sure;
raise money; take part; social care; people affected; gardens scheme;
dave clark; rob deering; better treatments; steve ford; one thing;
national gardens
None

Parkinson's UK readers

parkinsons disease; stem cell; raise money; awareness week; giving
page; would like; year ago; raising money; district branch; swindon
district; year old; well done; inner wheel; passed away; gargrave
leedsliverpool; anyone else; mental health; last year; hywel
griffiths; canal sept
None

First, we can see that in Parkinson's UK corpus some words that appear often together refer to specific events or persons. That means that Parkinson's UK writes long post about them, where these names are repeatedly written, but they actually are not that frequent.

When we compare both, we see that Parkinson's UK talks about raising 'money' and 'awareness'. Obviously, their readers are more interested in 'raising money', although 'awareness week' was also used frequently together.

Finally, words associated together were more positive in Parkinson's UK corpus, e.g. 'big difference', or 'better treatments', while their readers talked also about 'passed away' or 'mental health'.

Unique expressions¶

PUK and Not PUK talked about dad and mum at a different frequency; although we've seen that the difference between mum/dad was proportionally relatively similar. However, the context in which they appear could be quite different.

I first looked at the 'unique expressions' which included either variation of the female parent (mum, mums, mother, and mothers) and the same for dads.

In [11]:

printmd('**Unique expressions of PUK for the female parent: **')
print find_unique_exp(textnltk_puk,r"<.*>  | <.*> | <.*> | <.*> ")
printmd('** and for the male parent**')
print find_unique_exp(textnltk_puk,r" <.*>  |  <.*>  | <.*>  | <.*> ")
printmd('------------------')
printmd('**Unique expressions of PUK readers for the female parent: **')
print find_unique_exp(textnltk_pukreaders,r" <.*>  | <.*>  | <.*> | <.*>")
printmd( '** and for the male parent**')
print find_unique_exp(textnltk_pukreaders,r" <.*>  |  <.*>  | <.*> | <.*> ")

Unique expressions of PUK for the female parent:

my mums, its mothers, a mum, in mum, celebrate mothers, story mum, was mums, shazias mum, dad mum, a mother, my mother, his mother, their mum, my mum

and for the male parent

his father, diagnosis dad, their father, the father, nells dad, whose father, give dad, treat dad, a father, their dad, my dads, for dads, you dad, caroles dads, my father, “my father, now dad, my dad, mother father, for dad, her dad, her father, garden dad, noticed dads, our dad, year dad, family dad, best dad, sallys father, “my dad

Unique expressions of PUK readers for the female parent:

proud mum, lovely mum, always mum, precious mum, and mum, poor mum, old mother, you mum, helped mum, my mums, step mother, that mums, a mum, best mum, amazing mum, night mum, | mum, here mum, get mum, best mums, my mum, away mum, old mum, help mum, parkinsons mum, husband mum, dad mum, care mums, after mum, today mum, now mum, for mum, my mother, with mum

and for the male parent

told dad, husband father, know dad, amazing dad, our father, and dad, give dad, their dad, my dads, there dad, his dad, for dads, | dads, my father, of dad, but dad, her father, birthday dad, let dad, lovely dad, different dad, my dad

Since PUK's posts feature news and stories/interviews, the context in which the male and female parents uses varied pronouns 'my', 'his', 'their', .. This shows that the text's authors have multiple relationship to the parent that is discussed. On the other hand, authors that are not Parkinson's UK essentially talk about their own parent, highlighting that they have a similar relationship to 'parents'.

Interestingly, the female parent, who was less often mentioned than the male parent, seem to be associated with multiple positive adjectives: best, lovely, proud, precious, amazing.

Concordance¶

In order to understand better the context in which words are used, it is also possible to look at the concordance: the occurences of the word, in its context. It centers the word in focus to highlight the context in which it is used.

In [12]:

printmd("**Parkinson's UK**")
print textnltk_puk.concordance('mum')
printmd("**Parkinson's UK readers**")
print textnltk_pukreaders.concordance('mum')

printmd("**Parkinson's UK**")
print textnltk_puk.concordance('dad')
printmd("**Parkinson's UK readers**")
print textnltk_pukreaders.concordance('dad')

Parkinson's UK

Displaying 13 of 13 matches:
e says “i am at a loss how to help my mum mum has had parkinsons for 14 year no
ys “i am at a loss how to help my mum mum has had parkinsons for 14 year now sh
g straws make a difference to shazias mum and are her parkinsonsin1 whats yours
e to order there are versions for dad mum gran and grandad see wwwparkinsonsorg
public brought on a sense of shame in mum that caused her not to want to go out
erstand join us by sharing your story mum would have loved that parkinsonsin1 j
rgukstrategy | hi im joe and im 17 my mum has had parkinson’s since i was 9 yea
 of the best things i could do for my mum is to raise awareness and vital funds
nto finding a cure for parkinson’s my mum is only 50 shell kill me for telling 
far as we are all concerned i love my mum shes crazy in a slightly left of cent
g aeroplane for parkinson’s uk for my mum join joe do something amazing wwwpark
ng a bad day | whats it like having a mum with parkinsons to celebrate mothers 
ing about how they try and help their mum out if shes having a bad day | join u
None

Parkinson's UK readers

Displaying 25 of 60 matches:
ecommend a simple mobile phone for my mum please she isnt tech savvy in the rem
e happy but hes really angry so is my mum am doing this am 31 and my will is al
nd he just looks though you i told my mum as my mum was ill today so i had to t
 looks though you i told my mum as my mum was ill today so i had to take my dad
my dad got and i got mines right away mum thinks hes putting it on and looking 
 know what to do im trying to help my mum out and not be selfish but i really w
 dad doesnt settle until 3am and poor mum needs to stay up so am up until my da
happens i dont sleep as i worry as my mum needs more rest than i do sorry am go
raising money for parkinsons uk today mum reports that we took £43174 so with w
hing which is unlike him lucky me and mum stepped in the guy said to my dad you
let dad take control of the money now mum does it we dont give dad money unless
onth but taking a year out to help my mum with my dad mum needs a break my dadl
a year out to help my mum with my dad mum needs a break my dadleaving the cooke
ersonal cleaning is gone feel bad for mum and my and my sister need to tell my 
 and my and my sister need to tell my mum look whats daddone she talks to him h
 still learning and am worried for my mum she needs to be looked after too i fe
er please thanks | this is my amazing mum parkinsons and dementia for the last 
is is allowed if not i will remove my mum was diagnosed with parkinsons 3 year 
swindon parkinson specialist nurse my mum and i have been trying to contact her
id not return any of my calls when my mum finally got through she was told ever
 feel the same level of guilt that my mum and i now feel | christopher did this
ons glad to know anyones opinion | my mum was diagnosed with parkinsons today s
ealth httpbitly29lsg8m | iv cared for mum 2 year had to give up job i get carer
 money but its not enough i live with mum 247 can not even leave for an hour i 
 on breadline because i look after my mum which im not ever cut out to do i wil
None

Parkinson's UK

Displaying 25 of 32 matches:
g to talk his voice is very quiet now dad was diagnosed with parkinsons 7 year 
y birmingham uk hog turned up to give dad the ride of his life he said it felt 
ncredible day not just for us but for dad all of the family were so in awe espe
e only present we really want to give dad is a cure rachael | get together with
 town with a smaller house and garden dad was determinedly positive he walked s
red me to be the person i am today my dad the funny thing about it is i am scar
lish channel makes my stomach turn my dad was diagnosed with parkinson’s over 1
ee the physical and mental battles my dad has to go through day in day out to h
 get in the way i want to be the best dad ever for cohen the overwhelming sense
o vote by proxy httpbitlyeupuk16 | my dad never let life get him down he was ki
 parkinsons after the shock diagnosis dad carried on determined he would beat i
bably the hardest part for the family dad always had a quick mind for informati
ing up a tribute fundraising page for dad was mums way of raising funds to supp
to find a gift for father’s day treat dad and support parkinson’s uk through ou
 httpbitly1otr7vp | “hi i’m hannah my dad was diagnosed with parkinsons nearly 
weekago i had my absolute dream of my dad giving me away on my wedding day he w
most amazing memories to treasure “my dad is my absolute hero best friend and h
rough his parkinsons journey love you dad xxx” | “when the consultant told me i
 parkinson’s is not the end “since my dad got diagnosed 6 year ago i have alway
d to create a short documentary on my dad to contradict the negative perception
nson’s | the one thing that brings my dad his parkinsons and me together is the
ll help me go that last mile love you dad parkinsonsin1 rob deering comedian | 
 – steve ford chief executive | nells dad always finds the funny side bringing 
has parkinsons is my parkinsonsin1 my dad lived with parkinsons for many year w
omething i could talk about easily my dad passed away last year but the book is
None

Parkinson's UK readers

Displaying 25 of 92 matches:
deenshire i come to the usa to see my dad a pd patient several times a year i a
er about how things were going for my dad since i wa shome in april i was as we
he began to demand trips to ae not my dad at all to want to ask this three ae v
y ever so grateful cari sim me and my dad april of this year | hi all hope your
t im worried i have the sign of pd my dad has it and got told he had it at 55 h
nge in aid of parkinsons uk after his dad was diagnosed with the condition last
i dont have support from my family my dad is angry as he has parkinsons thought
alised parkinsons has got ahold of my dad noticed the past few weekand the past
ours getting my hair done and took my dad to tescos and he was really bad hes n
 know people is around him anymore my dad walks off and he just looks though yo
mum was ill today so i had to take my dad out but spending all day in the hair 
t hours in tescos for a few things my dad got and i got mines right away mum th
ion i told my doctor and im taking my dad to see the doctor in the morning but 
have have bloods at the same time but dad cant be left alone and someone needs 
 when everyone is in bed even that my dad doesnt settle until 3am and poor mum 
um needs to stay up so am up until my dad settles incase something happens i do
he looks like hes in his own world my dad nearly got robbed during the week and
and mum stepped in the guy said to my dad you look unwell think you need help a
 need help are you ok and touching my dad i went nuts and told the guy to get h
d told the guy to get his hands of my dad and hes fine and told dad to stay clo
hands of my dad and hes fine and told dad to stay close now my dad walks away w
ine and told dad to stay close now my dad walks away when you tell him to stay 
ping but needing to keep my eye on my dad a second he can be some where my dadn
ere my dadnow a target we used to let dad take control of the money now mum doe
he money now mum does it we dont give dad money unless were with him and that s
None

Here again, we see that most discussions involving parents are positive and upliftling stories for Parkinson's UK. Their readers, on the other hand, have more ambivalent stories: they express positive feelings towards their parent, but tell sometimes sad stories.

Bigrams¶

I finally looked into collocations, which are words that appear with one specific keyword. With this technique, it is possible to focus again on one specific keyword, and to quantify the words with which they appear.

I wanted to explore the relationship that readers have with specific members of their family. So I looked at the most associated words for 'my'.

In [13]:

from collections import defaultdict 

bgm = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(textnltk_puk)
finder_pukreaders = nltk.collocations.BigramCollocationFinder.from_words(textnltk_pukreaders)
scored = finder.score_ngrams(bgm.likelihood_ratio)
scored_pukreaders = finder_pukreaders.score_ngrams(bgm.likelihood_ratio)

# Group bigrams by first word in bigram.                                        
prefix_keys = collections.defaultdict(list)                                      
prefix_keys_high = collections.defaultdict(list)
for key, scores in scored:
    prefix_keys[key[0]].append((key[1], scores))

# Sort keyed bigrams by strongest association.                                  
for key in prefix_keys:
    prefix_keys[key].sort(key = lambda x: -x[1])

# Group bigrams by first word in bigram.                                        
prefix_keys_pukreaders = collections.defaultdict(list)
for key, scores in scored_pukreaders:
    prefix_keys_pukreaders[key[0]].append((key[1], scores))

# Sort keyed bigrams by strongest association.                                  
for key in prefix_keys_pukreaders:
    prefix_keys_pukreaders[key].sort(key = lambda x: -x[1])

printmd('**MY**')
bigram_my = prefix_keys['my'][:30]
bigram_my_pukreaders = prefix_keys_pukreaders['my'][:30]

df_bg_my = pd.DataFrame(bigram_my, columns=['word', 'PUK'])
df_bg_my_pukreaders = pd.DataFrame(bigram_my_pukreaders, columns=['word', 'PUKreaders'])
df_bg_both_my = pd.merge(df_bg_my_pukreaders,df_bg_my,on='word',how='outer')
df_bg_both_my['diff'] = df_bg_both_my['PUK'] - df_bg_both_my['PUKreaders']
df_bg_both_my = df_bg_both_my.sort_values('diff')
df_bg_both_my['word'] = df_bg_both_my['word'].str.encode('utf-8')

df_plot_bg = pd.melt(df_bg_both_my,id_vars=['word'],value_vars=['PUK','PUKreaders'])
bar = Bar(df_plot_bg, label=CatAttr(columns=['word'], sort=False), values='value', 
          stack='variable', title="Bigrams for 'my'", 
          width=550, height=300,legend='top_right',bar_width=0.7)
bar.xaxis.major_label_orientation = 20
bar.xaxis.major_label_text_font_size = '8pt'
show(bar,notebook_handle=True);

Dad is by far the most associated word with my in Parkinson's UK corpus. Parkinson's UK is however associating equally my with dad and wife, and then mum and husband. This suggest we could also look into the difference between husband and wife to find some other trends.

In the list of bigrams, 'New' caught my attention. I thought it would help finding what expectations and hopes their authors have.

In [14]:

printmd('**NEW**')
bigram_new = prefix_keys['new'][:20]
bigram_new_pukreaders = prefix_keys_pukreaders['new'][:20]
df_bg_new = pd.DataFrame(bigram_new, columns=['word', 'PUK'])
df_bg_new_pukreaders = pd.DataFrame(bigram_new_pukreaders, columns=['word', 'PUKreaders'])
df_bg_both = pd.merge(df_bg_new_pukreaders,df_bg_new,on='word',how='outer')
df_bg_both['diff'] = df_bg_both['PUK'] - df_bg_both['PUKreaders']
df_bg_both = df_bg_both.sort_values('diff')


df_plot_bg_new = pd.melt(df_bg_both,id_vars=['word'],value_vars=['PUK','PUKreaders'])
bar_bg_new = Bar(df_plot_bg_new, label=CatAttr(columns=['word'], sort=False), values='value', 
          stack='variable', title="Bigrams for 'new'", 
          width=550, height=300,legend='top_right',bar_width=0.7)
bar_bg_new.xaxis.major_label_orientation = 20
bar_bg_new.xaxis.major_label_text_font_size = '8pt'
show(bar_bg_new,notebook_handle=True);

NEW

Parkinson's UK was more interested in new research, study, laws while their readers talked more about new treatment or medication, as well as products or formula. This suggests that Parkinson's UK focuses on the future, by finding new ways to improve the life of people with Parkinson's, while their readers focus on the present, and discuss new medication.

I was surprised to find new lawn in PUK's readers.. so I looked at the concordance and found that only one person had talked about new lawn. Actually, in our text, bigrams with scores less than 20 are bigrams that only appear once, so they can be a bit misleading as they should not be looked as frequent, but rather just present.

1 - c. Conclusion of Part 1¶

I've here used multiple Natural Language Processing tools:

Most frequent words
Collocations
Unique expressions
Concordance
Bigrams

These tools helped the analysis of Facebook posts from two different groups: Parkinson's UK and their readers. It highlighted the common and diverging interest or priorities, but also highlighted some interesting gender imbalance in how people are affected by Parkinson's.

Interests and priorities

Parkinson's UK and their readers shared similar interests: diagnoses, raising awareness, raising money, research, and medications. However, priorities differed slighlty, with Parkinson's UK focus on research and their readers focus on medication. Parkinson's UK also presented stories that were telling positive stories, while their readers talked positively about their parent but had more negative expressions. I did not conduct any sentiment analysis on this text, but this would probably find a similar pattern.

Gender imbalance

Two main factors influence the gender of those who are talked about on Facebook. First, the number of people with Parkinson's; and second, the people who post on Facebook. Recent research finds a higher prevalence of Parkinson's in male than female, possibly due to the role of hormones in the development of the condition. Therefore more 'dads' might be talked about. On the other hand, as often stressed by Parkinson's UK, those who are affected by Parkinson's are not just those who develop it, but also all their families. This is supported by a reading of the text, which shows that when a person talks about their 'dad', they also mention their 'mum': in this case the dad is the person with Parkinson's, while the mum is the carer/family member. During the project, this was very quickly mentioned as a possible issue in applying Natural Langage Processing tools. When considering the same generation male vs female, we should remember that women might be sharing more than men on Facebook. Therefore, more husbands are likely to be mentioned, regardless of Parkinson's prevalence in the population.

Data science w/ a geo bias

Hélène Draux

NLP of a Facebook Page

0 - Clean the data¶

1 - PUK vs PUKreaders¶

1 - a. Word frequency¶

Frequent words per author, in both groups¶

1 - b. Context¶

Collocations¶

Unique expressions¶

Concordance¶

Bigrams¶

1 - c. Conclusion of Part 1¶

Related posts

NLP of a Facebook Page

0 - Clean the data¶

1 - PUK vs PUKreaders¶

1 - a. Word frequency¶

Frequent words per author, in both groups¶

1 - b. Context¶

Collocations¶

Unique expressions¶

Concordance¶

Bigrams¶

1 - c. Conclusion of Part 1¶

Related posts

Building a network (Part 1)

locationnewoffice

Tips using Jupyter to publish on a blog

import from the __future__

Tkinter in Mac

import from the future